Anthropic’s 2026 report says developers use AI in 60% of their work but can fully delegate only 0-20% of tasks [1]. That’s a massive gap. I dug into the public research to understand why — and the answer isn’t what most people assume.
As someone who uses AI coding agents daily, I assumed the bottleneck was model capability. It’s not. The research points to something more fundamental.
The Conversation Depth Problem
Most AI coding conversations are shockingly short. Looking at three large public datasets of real developer-AI interactions [2][3][4]:
This tells us something important: most developers are using AI for quick lookups and one-shot code generation, not for sustained collaboration on complex tasks. The “AI pair programmer” narrative doesn’t match how people actually use these tools.
This matches my own experience. I use AI coding agents daily — for everything from debugging production incidents to multi-day feature migrations. But when I look honestly at how I use them, the majority of my sessions are short: a quick lookup, a one-shot code generation, a “what does this error mean?” And the sessions that are long tend to be long not because they’re productive, but because I’m afraid to close them.
78% of AI Failures Are Invisible
A Stanford/Bigspin study analyzed 196,000 real ChatGPT conversations [5]:
I’ve caught myself doing this. An AI agent generates a plausible-looking fix, I skim it, it compiles, I move on. It’s only later — sometimes days later — that I realize it silently broke an edge case or introduced a subtle regression. The failure wasn’t loud. It was fluent.
The researchers found that 94% of these invisible failures would persist with a more capable model [5]. The #1 pattern (79% of failures): the model generates fluent output instead of asking for clarification.
The Eight Failure Archetypes
The study identified eight recurring patterns of invisible failure [5]:
Software development was the only domain with a high rate of visible failures — developers push back when code doesn’t work. In creative writing, education, and general knowledge, failures go almost entirely undetected.
The Vicious Cycle
Martin Fowler’s team published “Context Anchoring” in March 2026 [6], describing a dynamic that anyone who’s used an AI coding agent will recognize: developers keep conversations running far longer than they should — not because long sessions are productive, but because closing the session means losing everything.
They call it a vicious cycle. The context lives only in the chat. There’s no external record. So the conversation stretches on while the AI’s ability to recall earlier decisions quietly degrades. And when you finally close the session, you start from zero.
I’ve lived this cycle. I once ran a multi-day code migration that spawned dozens of sessions with an AI agent. Each new session started with me re-explaining the same architecture, the same constraints, the same decisions we’d already made. The agent was helpful within each session — but it couldn’t carry anything across them. I was the memory. And I’m not a great database.
Context Rot Is Real — and Measured
This isn’t just a feeling. Researchers have now quantified it. A study evaluating Claude, GPT-4.1, Llama 4, and o4-mini on long-context tasks found that success rates drop from 40-50% to less than 10% as context length increases [7]. The phenomenon has a name: context rot.
Longer context windows don’t fix this. A 1M-token window just means you can accumulate more noise before the model collapses.
Programming Is Now 50% of All LLM Usage
The OpenRouter 100-trillion-token study (covering 5M+ developers across 300+ models) shows the scale of this problem [9]:
Meanwhile, the AIDev dataset shows that 932,000 agent-authored pull requests have already landed on GitHub from tools like Codex, Devin, Copilot, Cursor, and Claude Code [10] — and their code is accepted less frequently than human-authored code, revealing a persistent trust gap.
The Trust Gap in Agent-Authored Code
That AIDev dataset deserves a closer look:
The delegation ceiling isn’t 0-20% because models are dumb. It’s because the tasks that matter most — multi-file features, architectural decisions, long-running migrations — are exactly the tasks where context rot, trust gaps, and interaction failures compound.
A Delegation Decision Framework
Based on the research above, here’s my practical framework for when to delegate vs supervise:
Status checks, searches, lookups] C -->|No| E{Creative/low-risk?} E -->|Yes| F[✅ Full Delegation
Writing, research, recommendations] E -->|No| G[👀 Supervised
Single-session code changes, bug fixes] B -->|Yes| H{Can you externalize context?} H -->|Yes| I[👀 Supervised + Context Doc
Use Fowler's context anchoring pattern] H -->|No| J[⚠️ Heavy Supervision
Expect restarts. Budget for re-explanation.] style D fill:#22c55e,color:#fff style F fill:#22c55e,color:#fff style G fill:#3b82f6,color:#fff style I fill:#3b82f6,color:#fff style J fill:#ef4444,color:#fff
What I’ve Learned Using AI Agents Daily
I’ve been using AI coding agents as my primary development tool for months now. Not as a novelty — as a daily driver for production work. Here’s what I’ve noticed:
The tasks I delegate most successfully are the ones I’d describe as “boring.” Searching tickets, checking pipeline status, looking up metrics, generating boilerplate. These are structured-input, structured-output tasks. The AI handles them perfectly because there’s no ambiguity and no accumulated context to lose.
Writing and research delegate surprisingly well. This blog post is a good example — I used an AI agent to help find papers, extract data, build charts, and iterate on drafts. The key difference from coding: writing is low-stakes to review. I can read a paragraph and know instantly if it’s wrong. I can’t always do that with code.
Incident response is where supervised collaboration shines. During a production incident, I provide the context (what service, what symptoms, what changed) and the AI investigates — pulling logs, querying metrics, suggesting hypotheses. Neither of us could do it as fast alone. But I’d never let the AI run an incident unsupervised.
Multi-day feature work is where everything breaks down. I’ve had migrations that spanned weeks of sessions. Each new session, I’d spend the first 10-15 minutes re-explaining the architecture, the constraints, the decisions we’d already made. The agent was productive within each session but had zero memory across them. I was the continuity layer — and I’m lossy.
The biggest surprise: I delegate non-coding tasks more successfully than coding tasks. Document review, research synthesis, writing drafts, even analyzing lease agreements. These tasks have clear success criteria and are easy to verify. Code, paradoxically, is harder to delegate because failures are silent and verification requires running the code, not just reading it.
What Needs to Change
The delegation gap won’t close with better models. It’ll close with better interaction design. The good news: the industry is already moving.
- Session memory is evolving fast. The first generation was static rules files — Cursor’s
.cursorrules, Copilot’sinstructions.md, Claude Code’sCLAUDE.md. Useful, but they’re “here’s my preferences,” not “here’s what we decided.” The next generation is more interesting: Claude Code now has auto-memory that accumulates project knowledge across sessions without you writing anything down. Google’s Antigravity generates its own context artifacts — task checklists, implementation plans, walkthroughs — that persist as the agent’s memory for future sessions. Anthropic’s Cowork gives agents persistent workspaces with dedicated storage, scheduled tasks, and project-scoped memory that survives between sessions. These are real steps toward Fowler’s context anchoring vision [6] — but they’re still early. The gap between “remembers my code style” and “remembers why we chose this architecture three sessions ago” remains wide. - Failure detection: 78% of failures are invisible because the AI never signals uncertainty. We need proactive drift detection — not just confident-sounding output.
- Clarification over generation: The #1 failure pattern (79% of cases) is the model generating fluent output instead of asking for clarification [5]. Models that say “did you mean X or Y?” will outperform models that guess.
The model isn’t the bottleneck. The session is. But the session is getting smarter.
References
- Anthropic, “Eight Trends Defining How Software Gets Built in 2026”, 2026.
- LMSYS, Chatbot Arena Conversations — 33K multi-turn conversations, HuggingFace.
- Zhao et al., “WildChat: 1M ChatGPT Interaction Logs in the Wild”, 2024 — 50K sample analyzed.
- Xiao et al., “DevGPT: Studying Developer-ChatGPT Conversations”, 2023 — 3.8K GitHub-linked conversations.
- Potts & Sudhof, “Invisible Failures in Human-AI Interactions”, Stanford/Bigspin, 2026 — 196K annotated ChatGPT transcripts.
- Fowler et al., “Context Anchoring”, martinfowler.com, March 2026.
- Levy et al., “Long-Context Reasoning Degradation in Web Agents”, 2025 — context rot measurement across Claude 3.7, GPT-4.1, Llama 4, o4-mini.
- Liu et al., “Lost in the Middle: How Language Models Use Long Contexts”, Stanford, 2023.
- Willison et al., “State of AI: 100 Trillion Token Study”, OpenRouter/a16z, 2025 — 5M+ developers, 300+ models.
- Murali et al., “AIDev: AI Coding Agents on GitHub”, 2026 — 932K agent-authored pull requests.
- Jain et al., “AI Coding Assistant Benchmark”, 2025 — Grok 4 (69.3%), Claude Opus 4 (68.5%), GPT-5 (67.8%).
- Mundler et al., “SWE-Dev: Evaluating LLMs on Real Feature Development”, 2025 — 14K feature dev tasks, Claude 3.7 Sonnet 22% Pass@3 on hard split.
No confidential or proprietary information is disclosed in this post.